Project Overview

  • Goal: Construct a model that accurately predicts whether an individual makes more than 50k/yr
  • Motivation: Help a non-profit identify donor candidates and understand how large of a donation to request
  • Data Source: 1994 US Census Data UCI Machine Learning Repository*

Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.

In [1]:
import numpy as np                                # Package for numerical computing with Python
import pandas as pd                               # Package to work with data in tabular form and the like
from scipy.stats import skew
from time import time                             # Package to work with time values
from IPython.display import display               # Allows the use of display() for DataFrames
import matplotlib.pyplot as plt                   # Package for plotting
import seaborn as sns                             # Package for plotting, prettier than matplotlib
import visuals as vs                              # Adapted from Udacity
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
In [2]:
# iPython Notebook formatting
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Account for changes made to imported packages
%load_ext autoreload
%autoreload 2
In [3]:
data = pd.read_csv("census.csv")

Featureset Exploration

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
In [4]:
data.info(null_counts=True)   # Show each d-type of column
pd.DataFrame(data.isna().sum()).T
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              45222 non-null  int64  
 1   workclass        45222 non-null  object 
 2   education_level  45222 non-null  object 
 3   education-num    45222 non-null  float64
 4   marital-status   45222 non-null  object 
 5   occupation       45222 non-null  object 
 6   relationship     45222 non-null  object 
 7   race             45222 non-null  object 
 8   sex              45222 non-null  object 
 9   capital-gain     45222 non-null  float64
 10  capital-loss     45222 non-null  float64
 11  hours-per-week   45222 non-null  float64
 12  native-country   45222 non-null  object 
 13  income           45222 non-null  object 
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB
Out[4]:
age workclass education_level education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [5]:
data.describe(include='all').T    # Summarize each factor, transpose the summary (personal preference)
Out[5]:
count unique top freq mean std min 25% 50% 75% max
age 45222 NaN NaN NaN 38.5479 13.2179 17 28 37 47 90
workclass 45222 7 Private 33307 NaN NaN NaN NaN NaN NaN NaN
education_level 45222 16 HS-grad 14783 NaN NaN NaN NaN NaN NaN NaN
education-num 45222 NaN NaN NaN 10.1185 2.55288 1 9 10 13 16
marital-status 45222 7 Married-civ-spouse 21055 NaN NaN NaN NaN NaN NaN NaN
occupation 45222 14 Craft-repair 6020 NaN NaN NaN NaN NaN NaN NaN
relationship 45222 6 Husband 18666 NaN NaN NaN NaN NaN NaN NaN
race 45222 5 White 38903 NaN NaN NaN NaN NaN NaN NaN
sex 45222 2 Male 30527 NaN NaN NaN NaN NaN NaN NaN
capital-gain 45222 NaN NaN NaN 1101.43 7506.43 0 0 0 0 99999
capital-loss 45222 NaN NaN NaN 88.5954 404.956 0 0 0 0 4356
hours-per-week 45222 NaN NaN NaN 40.938 12.0075 1 40 40 45 99
native-country 45222 41 United-States 41292 NaN NaN NaN NaN NaN NaN NaN
income 45222 2 <=50K 34014 NaN NaN NaN NaN NaN NaN NaN
In [6]:
n_records = data.shape[0]                                                   # First element of .shape indicates n
n_greater_50k = data[data['income'] == '>50K'].shape[0]                     # n of those with income > 50k
n_at_most_50k = data.where(data['income'] == '<=50K').dropna().shape[0]     # .where method requires dropping of na for this
greater_percent = round((n_greater_50k / n_records)*100,2)                  # Show proportion of > 50k to whole data

data_details = {"Number of observations": n_records,
                "Number of people with income > 50k": n_greater_50k,
                "Number of people with income <= 50k": n_at_most_50k,
                "Percent of people with income > 50k": greater_percent}     # Cache values of analysis

for item in data_details:                                                   # Iterate through the cache
    print("{0}: {1}".format(item, data_details[item]))                      # Print the values
Number of observations: 45222
Number of people with income > 50k: 11208
Number of people with income <= 50k: 34014
Percent of people with income > 50k: 24.78

Data Preprocessing

Before this data can be used for modeling and application to machine learning algorithms, it must be cleaned, formatted, and structured.

Split the data into features and labels

In [7]:
income_raw = data['income']
features_raw = data.drop('income', axis=1)

Skew

The features capital-gain and capital-loss are positively skewed (i.e. have a long tail in the positive direction).

To reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.

Why does this matter: The sample being observed may be a poor aproximation of the population.

Asymptotic normality indicates that a sample will approach a normal distribution even if the population is skewed.

  • As per the central limit theorem, as the sample size (the number of means, i.e. the number of samples) increases, $n\rightarrow \infty$, the sampling distribution of the means will become more normally distributed even though the population distribution is skewed.
  • When $n$ is large, asymptotic theory provides us with a more complete picture of the “accuracy” of $\lambda$: By the Law of Large Numbers, $\bar{X}$ converges to $\lambda$ in probability as $n \rightarrow \infty$. Furthermore, by the Central Limit Theorem, $$n(\bar{X} − \lambda) \rightarrow N (0, Var[X_{i}]) = N (0, \lambda)$$ in distribution as $n \rightarrow \infty$. So for large $n$, we expect $\hat{\lambda}$ to be close to $\lambda$, and the sampling distribution of $\hat{\lambda}$ is approximately $N\left(\lambda, \frac{\lambda}{n}\right)$

Source:Stanford: Introduction to Statistical Inference

In [8]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(
    go.Histogram(x=data['capital-loss'], nbinsx=25,
    name='Capital-Loss'),
    row=1, col=1

)

fig.add_trace(
    go.Histogram(x=data['capital-gain'], nbinsx=25,
    name='Capital-Gain'),
    row=2, col=1
)

fig.update_xaxes(title_text="Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Capital-Gain Feature Distribution", row=2, col=1)
fig.update_yaxes(title_text="Number of Records", range=[0, 2000], row=1, col=1)
fig.update_yaxes(title_text="Number of Records", range=[0, 2000], row=2, col=1)

fig.update_layout(height=800, width=1000,
                  title_text="Skewed Distributions of Continuous Census Data Features",
                  showlegend=False,
                  yaxis = dict(
                      tickmode = 'array',
                      tickvals = [0, 500, 1000, 1500, 2000],
                      ticktext = [0, 500, 1000, 1500, ">2000"]
                  )
                 )

fig.show()
In [9]:
cap_loss_skew = skew(data['capital-loss'])
cap_loss_var = np.var(data['capital-loss'])
cap_loss_mean = np.mean(data['capital-loss'])
cap_gain_skew = skew(data['capital-gain'])
cap_gain_var = np.var(data['capital-gain'])
cap_gain_mean = np.mean(data['capital-gain'])
fac_df = pd.DataFrame({'Feature': ['Capital Loss', 'Capital Gain'],
              'Skewness': [cap_loss_skew, cap_gain_skew],
              'Mean': [cap_loss_mean, cap_gain_mean],
              'Variance': [cap_loss_var, cap_gain_var]})
display(fac_df)
Feature Skewness Mean Variance
0 Capital Loss 4.516154 88.595418 1.639858e+05
1 Capital Gain 11.788611 1101.430344 5.634525e+07

Apply the logarithmic transformation:

In [10]:
skewed = ['capital-gain', 'capital-loss']
features_log_transformed = pd.DataFrame(data=features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x : np.log(x + 1))
In [11]:
fig = make_subplots(rows=2, cols=1)

fig.update_layout(height=800, width=1000,
                  title_text="Skewed Distributions of Continuous Census Data Features",
                  showlegend=False
                 )

fig.add_trace(
    go.Histogram(x=features_log_transformed['capital-loss'], nbinsx=25,
    name='Log of Capital-Loss'),
    row=1, col=1

)

fig.add_trace(
    go.Histogram(x=features_log_transformed['capital-gain'], nbinsx=25,
    name='Log of Capital-Gain'),
    row=2, col=1
)

fig.update_xaxes(title_text="Log of Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Log of Capital-Gain Feature Distribution", row=2, col=1)
fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                 patch = dict(
                     tickmode = 'array',
                     tickvals = [0, 500, 1000, 1500, 2000],
                     ticktext = [0, 500, 1000, 1500, ">2000"]),
                 row=1, col=1)

fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                 patch = dict(
                     tickmode = 'array',
                     tickvals = [0, 500, 1000, 1500, 2000],
                     ticktext = [0, 500, 1000, 1500, ">2000"]),
                 row=2, col=1)

fig.show()
In [12]:
log_cap_loss_skew = skew(features_log_transformed['capital-loss'])
log_cap_loss_var = round(np.var(features_log_transformed['capital-loss']),5)
log_cap_loss_mean = np.mean(features_log_transformed['capital-loss'])
log_cap_gain_skew = skew(features_log_transformed['capital-gain'])
log_cap_gain_var = round(float(np.var(features_log_transformed['capital-gain'])),5)
log_cap_gain_mean = np.mean(features_log_transformed['capital-gain'])
log_fac_df = pd.DataFrame({'Feature': ['Log Capital Loss', 'Log Capital Gain'],
              'Skewness': [log_cap_loss_skew, log_cap_gain_skew],
              'Mean': [log_cap_loss_mean, log_cap_gain_mean],
              'Variance': [log_cap_loss_var, log_cap_gain_var]})
fac_df = fac_df.append(log_fac_df, ignore_index=True)
fac_df['Variance'] = fac_df['Variance'].apply(lambda x: '%.5f' % x)
display(fac_df)
Feature Skewness Mean Variance
0 Capital Loss 4.516154 88.595418 163985.81018
1 Capital Gain 11.788611 1101.430344 56345246.60482
2 Log Capital Loss 4.271053 0.355489 2.54688
3 Log Capital Gain 3.082284 0.740759 6.08362

The logarithmic transformation reduced the skew and the variance of each factor.

Feature Skewness Mean Variance
Capital Loss 4.516154 88.595418 163985.81018
Capital Gain 11.788611 1101.430344 56345246.60482
Log Capital Loss 4.271053 0.355489 2.54688
Log Capital Gain 3.082284 0.740759 6.08362
In [14]:
# # Full Page - Code
!jupyter nbconvert Polishing_Donor_Classification.ipynb --output WIP_Class_Code --reveal-prefix=reveal.js --SlidesExporter.reveal_theme=serif --SlidesExporter.reveal_scroll=True --SlidesExporter.reveal_transition=none
# # Full Page - No Code
!jupyter nbconvert Polishing_Donor_Classification.ipynb --output WIP_Class_No_Code --reveal-prefix=reveal.js --SlidesExporter.reveal_theme=serif --SlidesExporter.reveal_scroll=True --SlidesExporter.reveal_transition=none --TemplateExporter.exclude_input=True
# # Slides - No Code
!jupyter nbconvert --to slides Polishing_Donor_Classification.ipynb --output WIP_Class_No_Code_Slides --TemplateExporter.exclude_input=True --SlidesExporter.reveal_transition=none
[NbConvertApp] Converting notebook Polishing_Donor_Classification.ipynb to html
[NbConvertApp] Writing 4675068 bytes to WIP_Class_Code.html
[NbConvertApp] Converting notebook Polishing_Donor_Classification.ipynb to html
[NbConvertApp] Writing 4642221 bytes to WIP_Class_No_Code.html
[NbConvertApp] Converting notebook Polishing_Donor_Classification.ipynb to slides
[NbConvertApp] Writing 4646670 bytes to WIP_Class_No_Code_Slides.slides.html
In [ ]: